Distributed Higher Order Text Mining

نویسندگان

  • Shenzhi LI
  • Christopher D. JANNECK
  • Xiaoning YANG
  • Aditya BELAPURKAR
  • Osama A. KHAN
  • Tianhao WU
  • Murat GANIZ
  • Mark D. DILSIZIAN
چکیده

-The burgeoning amount of textual data in distributed sources combined with the obstacles involved in creating and maintaining central repositories motivates the need for effective distributed information extraction and mining techniques. Recently, as the need to mine patterns across distributed databases has grown, Distributed Association Rule Mining (D-ARM) algorithms have been developed. These algorithms, however, assume that the databases are either horizontally or vertically distributed. In the special case of databases populated from information extracted from textual data, existing D-ARM algorithms cannot discover rules based on higher-order associations between items in distributed textual documents that are neither vertically nor horizontally distributed, but rather a hybrid of the two. In this article we present D-HOTM, a framework and system for Distributed Higher Order Text Mining. Unlike existing algorithms, those encapsulated in D-HOTM require neither full knowledge of the global schema nor that the distribution of data be horizontal or vertical. D-HOTM discovers rules based on higher-order associations between distributed database records containing the extracted entities. A theoretical framework for reasoning about record linkage is provided to support the discovery of higher-order associations. In order to handle record linkage, the traditional evaluation metrics employed in ARM are extended. The implementation of D-HOTM is based on the TMI and tested on a cluster at the National Center for Supercomputing Applications (NCSA). A sample manual run, results of experimental runs on the NCSA clusters, and theoretical comparisons demonstrate the performance and relevance of D-HOTM in e-marketplaces, law enforcement and homeland defense.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

D-HOTM: distributed higher order text mining

We present D-HOTM, a framework for Distributed Higher Order Text Mining based on named entities extracted from textual data that are stored in distributed relational databases. Unlike existing algorithms, D-HOTM requires neither full knowledge of the global schema nor that the distribution of data be horizontal or vertical. D-HOTM discovers rules based on higher-order associations between distr...

متن کامل

A very-short-text clustering method based on distributed representation to identifying research capabilities of a Higher Education Institution

Purpose. Text documents are an important source of data for tech mining techniques. Usually text databases include document sufficiently long to apply conventional text mining techniques. However in some tech mining tasks, such as capabilities identification process, we have database with very short texts, which represent a challenge for conventional text mining techniques. The problem has to d...

متن کامل

MOMEMI: Modern Methods of Data Mining

Modern data mining is used in order to classify and to discover relationships in big data sets. The papers, presented in the framework of the MOMEMI, deals with the most important fields of modern data mining: determining and use of patterns and templates, incremental reasoning, geometrical associations as well as text mining. Keywords-data mining; classification; forecast; cluster; association...

متن کامل

Integrating Biomedical Text Mining Services into a Distributed Workflow Environment

Workflows are useful ways to support scientific researchers in carrying out repetitive analytical tasks on digital information. Web services can provide a useful implementation mechanism for workflows, particularly when they are distributed, i.e., where some of the data or processing resources are remote from the scientist initiating the workflow. While many scientific workflows primarily invol...

متن کامل

Distributed Computation of Generalized One-Sided Concept Lattices on Sparse Data Tables

In this paper we present the study on the usage of distributed version of the algorithm for generalized one-sided concept lattices (GOSCL), which provides a special case for fuzzy version of data analysis approach called formal concept 78 P. Butka, J. Pócs, J. Pócsová analysis (FCA). The methods of this type create the conceptual model of the input data based on the theory of concept lattices a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006